Statistical Mechanics of Learning: A Variational Approach for Real Data 
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Using a variational technique, we generalize the statistical physics approach of learning from random exam- 
ples to make it applicable to real data. We demonstrate the validity and relevance of our method by computing 
approximate estimators for generalization errors that are based on training data alone. 
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In recent years, methods of statistical physics have con- 
tributed important insights to the theory of learning from ex- 
ample data with neural networks and other learning machines 
(for recent reviews see e.g. [[l], |||). Such systems are often 
described by models with a large number of degrees of free- 
dom which interact by a random energy function where the 
randomness in a learning problem is induced by the data. Ex- 
act solutions for the learning performance of a great variety of 
models have been obtained using tools which were developed 
for the physics of disordered materials, such as the replica 
trick All these case studies assume simple distributions 
of the data and give invaluable insights into the generic learn- 
ing behavior. However, they do not (and did not intend to) 
describe real world learning experiments where one has the 
additional problem that the distribution of the data is often un- 
known and idealized assumptions about it seem to be too re- 
strictive. It would be highly desirable if the statistical physics 
theory could provide practitioners with tools to predict and 
optimize the performance of a learning algorithm. 

In this Letter, we will present a step forward in this direc- 
tion. We combine the replica approach with a variational ap- 
proximation which enables us to deal with more realistic dis- 
tributions of the randomness. For models of supervised learn- 
ing |]l|], we demonstrate that the theory can predict relations 
between experimentally measurable quantities even though 
details of the data distribution are unknown. As an applica- 
tion, we will derive approximate expressions for data averaged 
performance measures for state of the art learning algorithms 
and test them on real data sets. We hope that our approach 
will inspire similar research in related complex adaptive sys- 
tems such as, for example, communication systems [|[|. 

We demonstrate the basic idea of our approach on a typi- 
cal learning scenario where we try to learn an unknown rule 
which maps inputs x E R d to outputs y from a set D of 
m example data pairs (xi, j/j), i — 1, . . . , m. The possible 
rules are modeled by a class of real valued functions f{x). 
To get a reasonable predictor f(x\D) on arbitrary novel in- 
puts x, a popular learning strategy is to balance the good- 
ness of fit on the training data measured by a training energy 
E[f; D] = Y2iLi Vi) w i tn the prior knowledge about 

the complexity of rules. In a probabilistic, Bayesian approach 
to learning nil EL both ingredients are combined in a proba- 



bility distribution 



(1) 



which is of the form of a Gibbs equilibrium distribution in 
statistical physics. It assigns different weights to functions 
/ of being responsible for the observed training examples 
D. /j,[f] encodes the prior knowledge about the plausibility 
of different functions. Proper choices of n[f] will penalize 
functions which fit the data well but are too complex to gen- 
eralize properly on novel inputs x. Parametric models ex- 
press / by a set of parameters w, which may , e.g., be the 
weights of a neural network. The distribution fi[f] is then in- 
duced by a distribution over w. Non-parametric models are 
obtained by assigning an a priori statistical weight di- 
rectly over the space of functions. Predictions for the out- 
put y on a novel input x can be made by suitably averag- 
ing the function value f(x) weighted by the distribution ([j]), 
where in the course of learning, when more and more data are 
observed, typically fi m [/] will become increasingly concen- 
trated around its mean. For continuous outputs (regression), 
we predict y = f(x\D) — (f(x)), and for binary classifi- 
cation problems, we predict y = sign[(/(x))], where angle 
brackets denote averages over ([!]). 

We will now specialize on models which are defined by 
Gaussian prior distributions //[/] over functions. Assuming a 
zero mean, they are fully specified by the correlation kernel 
K(x,x') = / d/j.[f] {f(x)f(x')} which must be supplied by 
the user. It encodes a priori assumptions about the typical vari- 
ability of model functions / with the input x. Such Gaussian 
process models (GP) have attracted considerable attention in 
recent years as they represent a flexible and widely applicable 
concept [@, 0]- GP models can be understood as a limit 
of Bayesian feed-forward neural networks when the number 
of hidden units grows to infinity Non-probabilistic "ker- 
nel machines" such as the celebrated support vector machines 
(SVMs) [|J] result from GP models by taking appropriate lim- 
its. GP also form a basis for field theoretic approaches to den- 
sity estimation [Q]. For all GP models, the mean predictions 
are expressed as a linear combination of kernel functions cen- 
tered at the input data (f(x)) = Y^JLi &jK(x, Xj), where the 
cvj 's are certain Gibbs averages which are independent on the 
points §§,|]. 

We study the typical learning performance of this kernel ap- 
proach by averaging over different drawings of training data 
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sets D where all examples (xi,yi) are generated indepen- 
dently from the same distribution p(y, x) = p(y\x)p(x). We 
use the replica approach for computing the averaged free en- 
ergy F = [— lnZ m ]r) which serves as a generating function 
for useful data averages. Here, we denote data averages by 

square brackets. We get F = — lim„^o 81 "^T^" P ■ To facil- 
itate subsequent calculations, we use a grand canonical for- 
mulation where the number of examples m is only fixed on 
average by a chemical potential /i. An elementary calcula- 
tion which uses the independence of the data, yields the grand 
canonical partition function for the n times replicated system 



S„(/i) 
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m=0 



^K] D = J ]]>[/«] exp[-H] (2) 



in terms of a Hamiltonian H = [H({f a ], x)] x which is the 
average of a purely local Hamiltonian density 



H({f a y,x) = - t 



exp< - ^2h(f a (x),y) 



(3) 



{y[- 



Here, the square brackets [. . ] x and [. . .] y \ x denote expecta- 
tions with respect to the distribution of inputs x and with re- 
spect to the conditional distribution of outputs y given the in- 
puts. The chemical potential /i is adjusted such that m = 
for all n which gives the simple result /i = In m in 
0. For sufficiently large m, we replace the sum 



the limit n 



over m in Eq. (|2|) by its dominating term and recover the orig- 
inal (canonical) free energy as F = — lim n _,o 8 — "g^ ln ■ 

Rather than evaluating (||) for simple and artificial distri- 
butions p(x, y) in the limit of high input dimensionality, we 
resort to a variational approximation which can adapt to prac- 
tically relevant situations. A similar approach was found to be 
useful in studying low dimensional disordered systems such 
as polymers or interfaces in random media JlO|]. We approx- 
imate H by a trial replica Hamiltonian Hq which minimizes 
the variational bound (see e.g. JTT[]) 



p n 

lnS„( M ) < -my IpM/J 
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The brackets (. . .)o denote an average with respect to the dis- 
tribution induced by n™=i lAfa\ e ~ H °- Equation (Q) is the 
leading term in a systematic perturbation expansion of the free 
energy with respect to the difference H — Hq. For Gaussian 
measures fi[f], a local trial Hamiltonian of the form 



H = 



^2Qab{x)fa(x)fb(x) + ^2Ra(x)fa{x) 

a<b a 



(5) 



is an appropriate choice. The most important feature of (|j) is 
the explicit dependence of the variational parameters Q a b(x) 
and R a (x) on the input variable x . This enables us to take the 
nonuniformity of realistic input densities into account. The re- 
sulting Gaussian approximation is expected to become asymp- 
totically exact for training energies h(f, y) that are smooth 



functions of /, when the Gibbs distribution ([I]) becomes in- 
creasingly concentrated around its mean for large ra. 

Expressing the averaged Hamiltonian (H)q by local order 
parameter fields R a (x) = (f a (x))o and Q a b{x, x) [the latter 
being a special case of the two point function Q a b(x,x') — 
(fa(x)fb(x'))o], a straightforward variation yields 



d(H) 
dR a (x) 



= Ra(x) 



d(H) c 



dQ ab (x,x) 



Qab(x) . (6) 



Assuming replica symmetry, i.e., R a (x) = R(x), 
Q ab {x,x') = Q(x,x') for a ^ b and Q aa (x,x') = 
Qo(x,x'), the order parameter fields have a simple phys- 
ical meaning in terms of averages over example data sets 
D. We have Q(x,x') = [(f(x))(f(x'))] D and Q (x,x') = 
[(f(x)f(x'))] D . G(x,x>) = [(f(x)f(x')) - (f(x))(f(x'))] D 
is the average posterior correlation function. R(x) = 
[(f(x))]n is the bias of the predictor, whereas V(x,x') = 
Q(x, x') — R(x)R(x') is its data covariance. A direct calcu- 
lation of order parameters from (Q) expresses these in terms 
of the variational parameters as R(x) = —{G(x,x')R(x')] x ' 
andV(x,x') = -[G{x,x")G(x' \x")Q{x")] x „. Finally, G is 
found as the operator inverse G = (K^ 1 + u) , where K 
is the kernel integral operator and u(x 1 x') = p(x)(Qo(x) — 
Q(x))S(x - x'). 

There are various ways to make use of our variational 
framework. We can use the probability measure defined by 
the trial Hamiltonian (||) in order to compute the average case 
performance of the learning algorithm at test points (x, y) not 
contained in the data set D. For example, the data average of 

the (mean square) prediction error £2 = [((f( x )) — y) 2 ] x .y 
is [e2{D)]d = [(R(x) — y) 2 + V(x,x)] XtV . Its sample fluc- 
tuations are [e 2 (D) 2 ] D - [e{D)f D = [4(R(x) - y)(R(x') - 
y')V(x, x') + 2V 2 (x, x')] x ^ x i :V: yi . Similar to previous studies 
in the statistical mechanics of learning [[j], ||], we could com- 
pute explicit learning curves from the variational equations, 
when the distribution of data is given. In general, explicit an- 
alytical solutions are possible for simple distributions, or in 
the asymptotic limit when the number of data grows large. 
We will give examples of such results elsewhere. Since in 
practice, however, data distributions are usually unknown, a 
more important application of our approach is in the deriva- 
tion of explicit relations between data averaged observables, 
which will hold (within the framework of our approximation) 
for any such distribution. Such relations can help to estimate 
the performance of an algorithm on novel data which were not 
contained in the training set D. 

To demonstrate this idea, we consider the problem of find- 
ing empirical estimates for generalization errors on novel test 
data (x,y). These estimates should be computable from the 
training data D only. Since in many real applications, test er- 
rors may be measured with an error function different from 
the original training energy h, we consider general error func- 
tions L((f(x)),x,y). Obviously, using the naive approxi- 
mation ^ YliLi L{(fi), Xi,yi) to estimate the expected error 
e L (D) = [L({f(x)),x,y)] x , y , where = /(x;), will give an 
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FIG. 1: Model selection based on Eq. (g) for a regression problem 
(Boston housing) and a binary classification problem (Sonar) using 
the kernel K(x,x') — exp( — ||x — x'\\ 2 /l 2 ). The right hand side 
of Eq. (^) is an empirical estimate (crosses) which is calculated only 
on training data. It gives a good account of the generalization error 
(circles) which was calculated on test data. 



optimistically biased estimate in most cases. In order to com- 
pute better, unbiased estimates for test errors which are based 
on the training data, we use the general result 
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Dz [g{{Fi{<l>))<t>, •••) (Fk(<i>))<j„x,y)}x,y 



where Dz = e z ' ' l 2 dz j '^/2tt and (. 
tion with respect to the distribution 



denotes an expecta- 



exp 



(R{x)+z^V(x x)-^) 2 
2G(x,x) 



y/2nG(x,x) Z(x,y,z) 



(8) 



with norm Z(x,y,z). Eq. (Q) is easily proved within our 
variational framework and holds for arbitrary functions g, 
Fi, . . . , Fk [p^|. To compute the desired unbiased estimates, 
we try to find a set of functions g and F\ , . . . Fk such that the 
right hand side of (fy can be rewritten as the data average of 
sl(D). 

We will demonstrate this idea for the GP model with mean 
square training error h2(f(x),y) = ^z(f( x ) ~ J/) 2 - This 
model finds widespread applications [Q, g| and has the ad- 
vantage that the computations of Gibbs averages ([j]) such as 
means and correlation functions can be performed analytically 
in closed form. Nevertheless, the analysis of the data aver- 
age is nontrivial, because averaging leads to a non Gaussian 
model in replica space which makes the application of our 
variational approximation still necessary. Using Fi(f) = f 
and F 2 (f) = f in Eq. (fl) a s well as [L((f(x)),x,y)] D = 
J Dz [L(R(x) + z^JV{x,x),x,y)] x .y yields 



J-l I „ n ! x l , y t 



(9) 



where a\ = (ff) — (fi) 2 - Generalizations to other GP mod- 
els and support vector machines are possible and will be given 
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FIG. 2: Verification of Eq. ^ on two benchmark data sets for 
binary classification with input dimension d — 7 (Pima Indian dia- 
betes, stars) and d = 60 (Sonar data, squares). The arrow points into 
the direction of increasing number m of training data with m = 4 — 7 
in steps of 1, m = 9 — 24 in steps of 5 and m = 33, 50, 100. 



elsewhere. We present tests of Eq. ^ on regression and bi- 
nary classification problems. In both cases (f(x)) is com- 
puted from the squared error h 2 with a 1 = 0.01 (Boston data) 
and er 2 = 0.1 (Sonar data), respectively. The left panel of 
Fig. [j] compares the generalization errors and their estimates 
for the publicly available Boston housing regression data set 
Jl3|] using the error function L((f(x)) , x, y) = \ (f(x)) — y\ h 
for k = 1, m = 50 and different widths I 2 of the kernel 
K(x,x') = exp( — ||x — x'|| 2 /Z 2 ). The results (which are av- 
eraged over 20 splits of the entire data set into training and test 
examples) suggest that our estimators might be well used for 
model selection, i.e., for finding the optimal kernel parame- 
ters with smallest generalization error. It is interesting to note 
that the case k — 2 leads to the exact leave one out estima- 
tor for the square error described in [Q]. The right panel of 
Fig. |l] shows corresponding results for binary classification 
(y = ±1) on the Sonar data, where now the generalization 
error is the average fraction of misclassified test points, i.e., 
L((f(x)),x,y) = Q(—y(f(x))), and 0(x) is the unit step 
function. 

In the following, we will test the validity of our variational 
replica approach for a case where the combination of a non- 
smooth training energy with a subtle singular limit might not 
suggest immediately that the trial Gaussian distribution (5) 
is a good approximation. We consider SVM classifiers [8] 
which can be understood as generalizations of single layer 
neural networks which allow for nonlinear separation be- 
tween classes. Expanding the positive definite SVM kernel 
K(x,x') = ^kipk^ipkix') in a set of implicit features 
ipk(x), the SVM output can be written as y = sign[/(a;)] 
where f(x) — J^k w kipk(x). The vector w of SVM weights 
Wi is determined by minimizing its length ||w|| under the 
condition that the "hard margin" training energy Hhm = 0. 
The latter is defined as h,HM{f{x),y) = if yf(x) > 1 
and hiiM = 00 otherwise. This SVM model is contained 
within the present GP framework [S] by introducing a Gaus- 
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sian prior distribution of variance e independently for each 
weight Wk- In the the limit e — > the posterior Gibbs dis- 
tribution becomes concentrated at the minimal length weight 
vector of the SVM. The Gaussian distribution over weights is 
equivalent to a Gaussian process over functions / with cor- 
relation kernel K t (x,x') — eK(x, x'). We test our method 
on the SVM margin k(D) = l/||w|| between positive and 
negative labeled examples which plays an important role as 
an indicator for good generalization ability of SVMs ||]. 
It can be shown that k(D) is related to the free energy as 
= [1/ k2 (£*)].d = 21im e _,o cF e . Using our variational 
approximation we get 



my/V(x, x) 
X(x,x) 



A(x,y) 



Dz{A(x,y) 



(10) 



x,y 



where A(x, y) 



i-yR(x) 



and x{x,x') 



lim e ^o e ~ l G e (x, x') is a response function which can 
be computed using the SVM algorithm. Equation ( |Tfj[ ) 
relates the averaged inverse squared margin on the training 
data to functions of the SVM's bias, its variance and the 
response function \ on novel test data (x,y). It generalizes 
Gardner's famous result jl4| ] for the optimal stability of 
linear perceptrons to SVMs under general data distributions. 
Figure ^ compares as computed from its basic definition, 
with our prediction given by Eq. for two publicly 

available classification data sets [Ol]. We used the kernel 
K(x,x') = exp(-||x - x'\\ 2 /l 2 ) Figure | is obtained 
for suboptimal choices of the correlation length I and different 
sizes m of the training data sets D. For each m we have 
performed an average over 20 splits of the entire data set 
into training and test examples. Our variational Gaussian 
approximation improves with growing training size m and 
higher data dimensionality. The accuracy of our result ( |io| ) 
improves further, if the kernel width I is optimized to achieve 
small generalization errors. 

The two examples presented so far are only a small selec- 
tion of the possible applications of our approach. Obvious 
future extensions will include an assessment of the reliability 
of estimators Eq. ([)]) for model selection by taking their sam- 
ple fluctuations into account. Also alternative, Bayesian cri- 
teria which use the minimization of the free energy for model 
selection fit naturally into the statistical physics framework. 
Generalizing the variational approach to models with statisti- 



cally dependent data such as e.g. (hidden) Markov processes, 
which are relevant for time series prediction, and to non Gaus- 
sian priors and trial Hamiltonians, will extend its applicabil- 
ity to the wider field of modern probabilistic data modeling. 
Finally, it will be necessary to estimate and improve the ac- 
curacy of our approximations. This can be done by comput- 
ing perturbative corrections to the variational free energy (Q), 
but also in a nonperturbative framework by including the pos- 
sibility of replica symmetry breaking. While the latter phe- 
nomenon is not expected to be relevant for most of the inter- 
esting kernel models with convex error functions, applications 
of the method in other fields like combinatorial optimization 
might definitely benefit from this extension. 
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